Welcome to BarCluster. This RMarkDown file is a tutorial using sample data to generate hybrid clusters from lineage barcoded single cell sequencing data. We will:
Read in our data (count matrix and barcode assignments).
Run PCA on our count matrix.
Create our hybrid clusters across a range of values.
Determine alpha values for analysis based on cluster number at fixed resolution.
Generate Sankey plots to show cluster rearrangement.
Identify cluster markers with ROC analysis.
The sample data provided is derived from one of our published samples, reduced in size and complexity to host on GitHub. It consists of the top 15 barcodes plus 100 singlet cells and the 501 most variable genes in replicate 1 of high dose BRAF treated WM989 cells from Goyal et al. biorXiv 2021.
Load required packages, get our tutorial data files:
library(data.table)
library(magrittr)
library(Seurat)
library(BarCluster)
library(ggplot2)
# get the location of the sample data installed with barcluster
dir <- system.file(package = "BarCluster") %>% file.path(., "extdata")
# count matrix file
cm <- file.path(dir, "YG1_sample_genes.txt")
# barcode assignment file
bt <- file.path(dir, "YG1_sample_barcodes.txt")
Now lets read in and see what our files look like.
# read in barcodes
bt <- data.table::fread(bt)
# what does it look like
head(bt)
## rn Barcode
## 1: AAACGAAAGAGGGTCT C9
## 2: AAAGGTAAGGTAGTAT C9
## 3: AAAGGTACACCATTCC C9
## 4: AAAGGTACAGTAGTGG C9
## 5: AAAGTGAAGGCCACCT C9
## 6: AACCATGGTTTGACAC C9
# read in count matrix
cm <- data.table::fread(cm)
# what does it look like
cm[1:5,1:5]
## rn AAACGAAAGAGGGTCT AAAGGTAAGGTAGTAT AAAGGTACACCATTCC AAAGGTACAGTAGTGG
## 1: ACTA2 -0.9795829 -1.5483850 0.6847103 -1.2496894
## 2: MYLK -1.8769908 -1.8002466 0.8519252 -1.3883718
## 3: S100B -1.5113756 -0.8227752 -0.9428514 -1.1884587
## 4: CRYAB -1.8161940 -1.7863511 -1.8369420 -1.4849199
## 5: CCL5 -1.1543626 -0.2587900 -1.1720407 -0.8835463
Everything read in correctly but the count matrix is formatted with cells as columns (a common output format). BarCluster expects cells as rows. No problem, I’ve included the functions tdt and dt2m in BarCluster to reformat the data.table and turn it into a matrix, respectively.
# transpose the data table we read in to get cells as rows using the tdt function
cm <- tdt(cm)
# convert data table to matrix using the dt2m function
cm <- dt2m(cm)
# what does it look like?
cm[1:5, 1:5]
## ACTA2 MYLK S100B CRYAB CCL5
## AAACGAAAGAGGGTCT -0.9795829 -1.8769908 -1.5113756 -1.816194 -1.1543626
## AAAGGTAAGGTAGTAT -1.5483850 -1.8002466 -0.8227752 -1.786351 -0.2587900
## AAAGGTACACCATTCC 0.6847103 0.8519252 -0.9428514 -1.836942 -1.1720407
## AAAGGTACAGTAGTGG -1.2496894 -1.3883718 -1.1884587 -1.484920 -0.8835463
## AAAGTGAAGGCCACCT -0.7632957 -1.6382829 -1.4243413 -1.012571 -1.0797454
The next step is to generate our PCA matrix. The function irlba_wrap is a wrapper that will do this for you. You can set the random seed and the number of output PCs if you want, see ?irlba_wrap. For this analysis, let’s summarize our 501 genes to 25 PCs:
# using 25 PCs
pca <- irlba_wrap(cm, npc = 25)
# peak at the first 5 columns and rows
pca[1:5, 1:5]
## PC_1 PC_2 PC_3 PC_4 PC_5
## AAACGAAAGAGGGTCT 22.84446 19.888593 -32.175042 32.374631 -4.999031
## AAAGGTAAGGTAGTAT 22.32305 -5.231187 -21.544455 -15.831288 -1.927493
## AAAGGTACACCATTCC 30.69033 4.068853 -27.413052 32.316806 -6.976072
## AAAGGTACAGTAGTGG 13.27684 -15.774159 -3.780693 -9.031577 -6.779002
## AAAGTGAAGGCCACCT 20.12647 -3.578279 -17.679333 -12.897062 2.571924
The steps to perform hybrid clustering are:
Generate transcriptome shared nearest neighbor graph
Generate size normalized barcode graph
Integrate the two with the BarCluster model
Repeat for many values of alpha to determine max alpha with tolerable cluster numbers.
Luckily, this is all executed with a single function whose inputs are the barcode table and the PCA we just made. The barcluster function does all this for one or more alpha values. The beta parameter will affect how each step in alpha affects the output, we are using 0.1 here, but feel free to experiment. The res argument is passed to the Seurat implementation of the Louvain algorithm. It will affect the number of clusters at low alpha, but at high alpha the determining factor is the number of barcodes in the sample. I suggest you choose a value for resolution that gives you a number of manageable clusters at alpha == 0, or simply the same value as whatever initial Seurat analysis you have performed on the data.
# lets get our range of alpha values
als <- seq(0, 1, by = 0.1)
# return the cluster assignments for range of alphas
clust <- barcluster(pca, bt, alpha = als, beta = 0.1, res = 1.5)
# lets take a peak at the output cluster assignments
head(clust)
## rn Group alpha resolution beta
## 1: AAACGAAAGAGGGTCT 8 0 1.5 0.1
## 2: AAAGGTAAGGTAGTAT 2 0 1.5 0.1
## 3: AAAGGTACACCATTCC 8 0 1.5 0.1
## 4: AAAGGTACAGTAGTGG 1 0 1.5 0.1
## 5: AAAGTGAAGGCCACCT 2 0 1.5 0.1
## 6: AACCATGGTTTGACAC 1 0 1.5 0.1
These are your hybrid clusters.
clust is a data table with the cluster assignment of each cell at all alpha values. First, lets identify points to conduct our analysis by making a plot of cluster number vs alpha.
# returns a table with column V1L number of groups at that value of alpha
nt <- clust[, Group %>% unique %>% length, by = "alpha"]
# plot it
ggplot(nt, aes(x = alpha, y = V1)) +
geom_point(color = "dodgerblue", size = 3) +
geom_line(color = "grey", linetype = "dashed") +
theme_bw() +
ttheme +
xlab("\u03B1") +
ylab("# of clusters") +
scale_x_continuous(breaks = seq(0, 1, by = 0.1))
It looks like there are four interesting points on the plot for this sample data (downsampled for top barcodes and 100 singlets from the WM989 high dose BRAF inhibitor replicate 1). I’ll annotate them with asterisks here:
From left to right, these points are the transcriptome clusters (alpha == 0), a nadir where addition of lineage information causes clusters to join (alpha == 0.2), the inflection point (alpha == 0.4), and the maximum value before clusters break apart into barcodes (alpha == 0.9).
Let’s make a sankey plot to see the reorganization, subsetted on our alphas of interest. You could do this on the full range of alphas we ran by omitting the brackets and their contents after clust in the next call, but 100+ nodes on a sankey requiring individual discrete colors is not easy to grok.
notable_alphas <- c(0, 0.2, 0.4, 0.9)
p <- Plot_alluvia(clust[alpha %in% notable_alphas],
bt,
title = "Sample Sankey",
xlab = "\u03B1", # unicode symbol for alpha
ylab = "# of cells",
border_size = 1,
label_nodes = FALSE, # labels nodes but hard to viz here
cols = BarCluster::c25
)
p[[2]] <- p[[2]] + ggtitle("Sample Sankey", subtitle = "Colored by \u03B1 = 0.9 clusters")
p[[1]]
p[[2]]
Let’s find the top cluster marker per cluster at each level. We will do this for ROC analysis for in vs out of cluster cells across all genes in the count matrix. I’ve written a function to make this easier. This is not fast however. If you already have a Seurat object, this may be quicker by input clusters as the active identity and using the Seurat::FindAllMarkers function. If you want to skip this step and just go right to the discovered markers, see the hashed annotations in this code.
# to skip this step:
# auc_table <- data.table::fread(file.path(dir, "YG1_markers.txt"))
auc_all <- Find_Markers_ROC(clust[alpha %in% notable_alphas], cm)
# you can save this table with:
# auc_table %>% data.table::fwrite("mytable.txt")
# what does it look like?
head(auc_all)
## rn auc thresh direction Group alpha
## 1: ACTA2 54.74387 -0.80533402 greater 10 0
## 2: MYLK 72.84893 0.07535532 greater 10 0
## 3: S100B 60.21109 -1.23971079 less 10 0
## 4: CRYAB 73.24235 -1.60858012 less 10 0
## 5: CCL5 64.60441 -0.96643621 less 10 0
## 6: ACTG2 63.16149 0.06621759 greater 10 0
# Take the best auc for any cluster at each alpha
auc_table <- auc_all[order(-auc)] %>% unique(by = c("rn", "alpha"))
head(auc_table)
## rn auc thresh direction Group alpha
## 1: IFIT3 98.78745 13.306560 greater 15 0.0
## 2: CKS2 98.65591 2.622634 greater 10 0.2
## 3: UBE2S 98.60983 4.128507 greater 10 0.2
## 4: IFIT1 98.32552 5.817067 greater 15 0.0
## 5: CCNB1 98.20917 3.886104 greater 10 0.2
## 6: MYLK 98.18751 14.881343 greater 3 0.0
It may be useful to look at how marker strength changes across hybrid clusters. To assess this, we will first take the best possible AUC for each marker at each alpha and make a wide table to look at strength.
# take our long format data and make it wide
aucw <- auc_table[, auc, by = c("alpha", "rn")] %>%
dcast(rn ~ alpha)
## Using 'auc' as value column. Use 'value.var' to override
# table of marker strength
head(aucw)
## rn 0 0.2 0.4 0.9
## 1: A2M 80.64516 67.80267 69.17082 69.17082
## 2: ABI3BP 70.22387 64.79895 68.64055 68.51136
## 3: AC104654.2 73.36854 70.31086 71.18387 71.18387
## 4: AC131025.8 80.57216 78.81208 79.13879 78.06223
## 5: ACTA2 97.15644 91.17165 94.24483 93.57567
## 6: ACTG2 84.63426 86.68160 85.64068 86.40088
Let’s identify some markers to plot it on Sankey.
aucw[, delta := `0` - `0.9`]
# marker that falls off the most
transcriptome_only <- aucw[order(-delta), rn[1]]
# marker that improves the most
alpha_enhanced <- aucw[order(delta), rn[1]]
print(c(transcriptome_only, alpha_enhanced))
## [1] "ZC3HAV1" "KCNMA1"
Let’s plot the transcriptome_only marker. We will make in cluster, marked cells in purple and out of cluster marked cells in grey.
# get the transcriptome cluster that the is the most effective for our marker
at <- auc_table[alpha == 0 & rn == transcriptome_only]
# get our marked cluster members
cluster_members <- clust[alpha == 0 & Group == at[, Group], rn]
# retrieve marked cells
all_pos <- rownames(cm)[cm[, transcriptome_only] > at[, thresh]]
# check if the sign is flipped for the AUC
if (at[, direction == "less"])
all_pos <- rownames(cm)[cm[, transcriptome_only] < at[, thresh]]
# get true positives (in cluster, positive)
tp <- intersect(cluster_members, all_pos)
# get false positives (out of cluster, positive)
fp <- all_pos[!all_pos %chin% cluster_members]
# generic plotting function, tracks any group of cells as a ribbon
p <- Plot_alluvia_track(clust[alpha %in% notable_alphas],
ids = list(fp, tp),
title = paste0(transcriptome_only, "-positivity"),
xlab = "\u03B1",
ylab = "# of cells",
label_nodes = FALSE,
border_size = 1,
flow_alpha = 1,
cols = c("grey80", "purple")
)
# put table in the same order as the plot
auc_table %<>% .[order(alpha)]
# AUC annotations
lab <- paste0("AUC: ", auc_table[rn == transcriptome_only]$auc %>%
`/`(.,100) %>% sprintf(fmt = "%.2f"))
# add AUC annotations
p <- p +
annotate("text", x = 1:4,
y = nrow(bt) * 1.1, label = lab, fontface = "bold", size = 6)
# add text theming
p <- p + ttheme + theme(plot.title = element_text(size = 12))
p
Let’s plot an alpha-enhanced marker this time. We will put in cluster, marked cells in purple and out of cluster marked cells in grey.
# get the hybrid cluster that the is the most effective for our marker
at <- auc_table[alpha == last(notable_alphas) & rn == alpha_enhanced]
# get our marked cluster members
cluster_members <- clust[alpha == last(notable_alphas) & Group == at[, Group], rn]
# retrieve marked cells
all_pos <- rownames(cm)[cm[, alpha_enhanced] > at[, thresh]]
# check if the sign is flipped for the AUC
if (at[, direction == "less"])
all_pos <- rownames(cm)[cm[, alpha_enhanced] < at[, thresh]]
tp <- intersect(cluster_members, all_pos)
fp <- all_pos[!all_pos %chin% cluster_members]
p <- Plot_alluvia_track(clust[alpha %in% notable_alphas],
ids = list(fp, tp),
title = paste0(alpha_enhanced, "-positivity"),
xlab = "\u03B1",
ylab = "# of cells",
label_nodes = FALSE,
border_size = 1,
flow_alpha = 1,
cols = c("grey80", "purple")
)
auc_table %<>% .[order(alpha)]
lab <- paste0("AUC: ", auc_table[rn == alpha_enhanced]$auc %>%
`/`(.,100) %>% sprintf(fmt = "%.2f"))
p <- p +
annotate("text", x = 1:4,
y = nrow(bt) * 1.1, label = lab, fontface = "bold", size = 6)
p <- p + ttheme + theme(plot.title = element_text(size = 12))
p
Let’s quickly plot the overall strength of top cluster markers across alpha values
# top cluster markers boxplot
topmarkers <- auc_all %>% split(by = c("alpha", "Group")) %>%
lapply(., function(t){
return(t[auc == max(auc)])
}) %>% data.table::rbindlist()
ggplot(topmarkers, aes(x = as.factor(alpha), y = auc)) +
geom_boxplot(fill = "dodgerblue") +
theme_bw() +
ttheme +
theme(plot.title = element_text(size = 12)) +
ggtitle("Top cluster markers") +
xlab("\u03B1") +
ylab("AUC as %") +
ylim(50,100)
Now let’s plot the maximum strength for any cluster of top markers over alpha.
m2b <- aucw[rn %chin% topmarkers[, rn], .SD, .SDcols = c("rn", as.character(notable_alphas))]
m <- dt2m(m2b)
pheatmap::pheatmap(m, cluster_cols = FALSE)
For the last part of our analysis, lets explore our barcodes and clusters in UMAP space. We will start by visualizing our clustering alphas in default UMAP space.
TODO show wf levels, pick an optimal tr and low alpha clusters at 0 and pick some other wf, then show a few barcodes done
# lets do multiple warp factors
wfs <- c(0, 2, 4, 6, 8, 10)
# get our warped UMAPs
umaps <- lapply(wfs, function(s){
mo <- barcode_warp(pca, bt, s)
um <- umap_matrix(mo)
um[, warp := s]
}) %>% data.table::rbindlist()
# add our barcodes to the table
umaps <- merge(umaps, bt, by = "rn")
# color by barcode and put singlets into one category
umaps[, Barcode :=
ifelse(rn %>% unique %>% length > 1, Barcode, "Singlet"),
by = "Barcode"]
ums <- lapply(umaps[, Barcode %>% unique], function(g){
ump <- umaps %>% data.table::copy()
ump[Barcode == g, col := "00"]
ump[Barcode != g, col := "ZZ"]
um <- ggplot(ump, aes(x = UMAP_1, y = UMAP_2)) +
geom_point(aes(col = col), size = 0.3, alpha = 0.5) +
facet_wrap(~warp) +
scale_color_manual(values = c("grey", cw_colors[2])) +
ttheme +
theme_void() +
theme(legend.position = "none")
return(um)
})
Let’s have a look at how our UMAP looks with increasing warp factor for each barcode:
ums
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
##
## [[10]]
##
## [[11]]
##
## [[12]]
##
## [[13]]
##
## [[14]]
##
## [[15]]
##
## [[16]]